url <- 'https://raw.githubusercontent.com/eitanaka/DATS6101_Final_Project_Team2/main/data_set/geo_socio_health_df.csv'
master_df<- read_csv(url)
master_df <- master_df[,-1]
# To rename some columns in a data frame to make them more readable and easier to work with.This is done by using the "colnames" function to first select the column names that match the original names, and then assigning new names.
colnames(master_df)[colnames(master_df) == "MT_Never Married"] <- "mt.nev.mar"
colnames(master_df)[colnames(master_df) == "MT_Now married"] <- "mt.now.mar"
colnames(master_df)[colnames(master_df) == "Total Population"] <- "tot.pop"
colnames(master_df)[colnames(master_df) == "EA_Less than high school graduate"] <- "ea.less.hs.deg"
colnames(master_df)[colnames(master_df) == "EA_High school graduate"] <- "ea.hs.deg"
colnames(master_df)[colnames(master_df) == "EA_college or associate's degree"] <- "ea.col.ass.deg"
colnames(master_df)[colnames(master_df) == "EA_Bachelor's degree"] <- "ea.ba.deg"
colnames(master_df)[colnames(master_df) == "EA_Graduate or professional degree"] <- "ea.grad.prof.deg"
# Create a new data only including numerical variable
numeric_vars <- sapply(master_df, is.numeric)
num_df <- master_df[, numeric_vars]
In project 1, we investigated how health risk behaviors are associated with health outcomes and health status.We have 11 different datasets that we have made used of. We have tried importing all the datasets from the google drive where all datasets has been uploaded.Now, we are trying to examine whether there is any significant influence on mental health due to lifestyle, socio-economic factors, and covid-19 in the United States. Census tracts will be used in calculating and analyzing all the different datasets. This research which is going to be very interesting to look at identifying what factors have significantly impacted mental health and well-being in the United States in 2020.
Need edited (Describe about 4 data set and how we create) We have used of 10 different datasets and the one which we are going to use from the previous Data Science project1.
SQ1: What factors do we observed highly associated with depression and poor mental health rates? SQ 2: Using those factors, how accurately can we infer depression and poor mental health rates in a given tract?
In this project, we perform EDA on a dataset containing health and socioeconomic factors for different counties in the United States. The purpose of the analysis is to investigate the relationship between these factors and two health outcomes: depression rate and poor mental health rate. We check the data quality, including missing values, outliers, and inconsistencies. We then examine the distributions of the variables and their relationships using visualizations such as histograms, scatterplots, and correlation matrices. The EDA reveals that some variables are highly correlated with each other, indicating the presence of multicollinearity. Overall, the EDA helps us to understand the data, identify potential issues, and guide our subsequent analysis. The insights gained from this analysis have important implications for public health policy and future research.
The data structure of the dataset is a table with 60,682 rows and 57 columns. The columns represent various variables, including health indicators, demographic characteristics, and economic indicators. The data types of the variables are numeric, except for the one-factor variable (index_apr20). The dataset has also omitted some rows due to missing data, as indicated by the "na.action" attribute. The dataset is a comprehensive health and demographic information collection for a large population.
str(num_df)
## tibble [60,962 × 57] (S3: tbl_df/tbl/data.frame)
## $ ACCESS2 : num [1:60962] 15 19.9 12.5 18 19.6 16.7 16.5 21.1 12.7 17.8 ...
## $ BINGE : num [1:60962] 15.8 14.1 15.2 14.9 14.9 15.5 14.9 12.5 16.6 14.9 ...
## $ CHECKUP : num [1:60962] 74.1 76.5 75.5 74.6 74 74.3 75.2 78.3 73.9 75.5 ...
## $ DEPRESSION : num [1:60962] 25.9 23.9 24 26.9 28.5 26.2 25.9 24.5 24.6 25.9 ...
## $ DIABETES : num [1:60962] 10.7 13.4 10.2 12.8 12.8 11.3 12.8 17.4 9.7 13.8 ...
## $ LPA : num [1:60962] 26.4 32.1 23 31.1 33.4 28.2 29.8 37.2 22.8 29.6 ...
## $ MHLTH : num [1:60962] 16.4 17.3 14 17.9 19.1 17 17 17.7 15.1 16.9 ...
## $ OBESITY : num [1:60962] 37 43.9 33.2 41.1 41 37.9 40.3 46.1 35.7 36.2 ...
## $ PHLTH : num [1:60962] 11.3 12.1 9.8 13.5 14.5 11.6 12.8 15.4 9.8 13 ...
## $ SLEEP : num [1:60962] 36.9 43.4 33.4 39.5 39.8 38.1 39.2 43.8 35.8 37.6 ...
## $ STROKE : num [1:60962] 3.1 3.7 3 3.8 4 3.3 3.8 5.3 2.7 4 ...
## $ mt.nev.mar : num [1:60962] 23.7 39.5 20.9 27.5 36.9 ...
## $ mt.now.mar : num [1:60962] 54.8 31.8 63.8 52.9 37 ...
## $ MT_Divorces : num [1:60962] 11.97 18.56 10.09 9.62 18.61 ...
## $ MT_Separated : num [1:60962] 1.39 2.62 0 1.17 1.61 ...
## $ MT_Widowed : num [1:60962] 8.11 7.54 5.27 8.87 5.9 ...
## $ ea.less.hs.deg : num [1:60962] 14.29 10.59 6.23 12.86 17.91 ...
## $ ea.hs.deg : num [1:60962] 33.8 50.3 28.1 31.2 38.4 ...
## $ ea.col.ass.deg : num [1:60962] 27 23.2 26.4 34.3 28 ...
## $ ea.ba.deg : num [1:60962] 15.4 12.9 22.8 12.2 10.8 ...
## $ ea.grad.prof.deg : num [1:60962] 9.4 2.97 16.43 9.5 4.93 ...
## $ MI_Estimate : num [1:60962] 26504 26173 37529 20669 24181 ...
## $ tot.pop : num [1:60962] 1941 1757 3539 3536 3562 ...
## $ CT_<10 : num [1:60962] 17.68 5.65 13.98 21.38 19.81 ...
## $ CT_10-14 : num [1:60962] 27.6 17.54 8.91 13.07 12.08 ...
## $ CT_15-19 : num [1:60962] 7.02 3.48 13.78 16.25 15.53 ...
## $ CT_20-24 : num [1:60962] 14.29 7.68 14.37 10.42 20.91 ...
## $ CT_25-29 : num [1:60962] 5.57 0.87 8.26 5.39 6.07 ...
## $ CT_30-34 : num [1:60962] 18.8 39.9 22.1 18.6 16.1 ...
## $ CT_35-44 : num [1:60962] 7.02 14.2 7.87 7.51 3.59 ...
## $ CT_45-59 : num [1:60962] 1.69 8.41 3.64 3.27 4.55 ...
## $ CT_>60 : num [1:60962] 0.363 2.319 7.087 4.152 1.38 ...
## $ ES_Total_labor_force : num [1:60962] 54.7 50.3 54.5 44.9 58.9 ...
## $ ES_Civilian_labor_force : num [1:60962] 54.7 49.4 54.2 43.5 58.3 ...
## $ ES_Civilian_labor_force_employed : num [1:60962] 53.5 47.4 52.9 42.3 53.5 ...
## $ ES_Civilian_labor_force_unemployed: num [1:60962] 1.16 2 1.27 1.21 4.78 ...
## $ ES_Armed_Forces : num [1:60962] 0 0.828 0.327 1.423 0.56 ...
## $ ES_Not_in_labor_force : num [1:60962] 45.3 49.7 45.5 55.1 41.1 ...
## $ land : num [1:60962] 3.79 1.29 2.46 3.1 8.65 ...
## $ urban : num [1:60962] 1594 2170 4386 3595 2505 ...
## $ poverty : num [1:60962] 11.34 17.88 2.85 21.58 30.5 ...
## $ no.ins : num [1:60962] 9.26 9.37 2.82 14.87 16.23 ...
## $ disab : num [1:60962] 17.6 17.1 19.6 25.7 24.7 ...
## $ no.comp : num [1:60962] 11.23 17.36 7.6 8.45 9.54 ...
## $ broad&comp : num [1:60962] 80.9 79.2 86 87.6 85.1 ...
## $ no.eng : num [1:60962] 1.83 0.56 0.98 0.68 0.99 0 0 0 0 0 ...
## $ sing.mom : num [1:60962] 14.12 17.25 7.69 9.38 27.48 ...
## $ live.alone : num [1:60962] 22.9 37.3 28.4 18.7 28.5 ...
## $ pub.assist : num [1:60962] 0 1.11 1.22 0.68 4.51 2.16 1.27 0.45 1.47 0 ...
## $ no.phone : num [1:60962] 2.88 2.36 2.5 0.45 0.78 0.69 0.49 4.91 0.88 0.71 ...
## $ no.plumb : num [1:60962] 0 2 0.86 3.47 0 ...
## $ married.kid : num [1:60962] 36 49.2 36.3 43.6 50.1 ...
## $ hhd.no.comp : num [1:60962] 18.3 27.8 10.4 11.6 13.1 ...
## $ hhd.only.phone : num [1:60962] 6.01 5.7 4.27 5.41 3.59 ...
## $ hhd.no.int : num [1:60962] 26.5 30.5 17.5 17.2 17.7 ...
## $ hhd.broad : num [1:60962] 59.7 55.1 74.6 65.4 66 ...
## $ index_apr20 : num [1:60962] 0.913 0.913 0.913 0.913 0.913 ...
The table displays various health and socio-economic indicators for a population. The mean values for several health indicators, such as depression and obesity, are high, indicating potential health concerns in the population. Education levels vary, with a mean of 12.2% having less than a high school degree and 28.8% having some college or an associate’s degree. The poverty rate is 15.3%, and 9.1% of the population does not have health insurance. Additionally, 13.3% of households do not have a computer and 28.1% of individuals live alone.
summary(num_df)
## ACCESS2 BINGE CHECKUP DEPRESSION DIABETES
## Min. : 2.6 Min. : 2.7 Min. :49.7 Min. : 8.5 Min. : 0.6
## 1st Qu.: 9.8 1st Qu.:14.7 1st Qu.:70.6 1st Qu.:17.8 1st Qu.: 8.5
## Median :13.3 Median :16.7 Median :74.8 Median :20.3 Median :10.3
## Mean :15.5 Mean :16.8 Mean :74.0 Mean :20.4 Mean :10.9
## 3rd Qu.:19.2 3rd Qu.:18.7 3rd Qu.:77.7 3rd Qu.:22.9 3rd Qu.:12.6
## Max. :64.9 Max. :36.6 Max. :90.8 Max. :37.8 Max. :42.2
##
## LPA MHLTH OBESITY PHLTH SLEEP
## Min. : 7.8 Min. : 6.1 Min. :11.5 Min. : 2.3 Min. :19.8
## 1st Qu.:18.7 1st Qu.:13.1 1st Qu.:28.0 1st Qu.: 8.4 1st Qu.:30.7
## Median :23.7 Median :15.0 Median :33.1 Median :10.3 Median :33.5
## Mean :24.6 Mean :15.1 Mean :33.0 Mean :10.8 Mean :34.0
## 3rd Qu.:29.5 3rd Qu.:16.9 3rd Qu.:37.7 3rd Qu.:12.7 3rd Qu.:36.7
## Max. :63.7 Max. :33.0 Max. :63.9 Max. :31.3 Max. :54.4
##
## STROKE mt.nev.mar mt.now.mar MT_Divorces MT_Separated
## Min. : 0.20 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 2.40 1st Qu.: 24.6 1st Qu.: 37.7 1st Qu.: 7.9 1st Qu.: 0.6
## Median : 3.00 Median : 31.5 Median : 48.2 Median : 10.7 Median : 1.5
## Mean : 3.19 Mean : 34.2 Mean : 46.7 Mean : 11.2 Mean : 1.9
## 3rd Qu.: 3.70 3rd Qu.: 41.4 3rd Qu.: 57.1 3rd Qu.: 13.9 3rd Qu.: 2.7
## Max. :20.50 Max. :100.0 Max. :100.0 Max. :100.0 Max. :41.8
## NA's :17 NA's :17 NA's :17 NA's :17
## MT_Widowed ea.less.hs.deg ea.hs.deg ea.col.ass.deg ea.ba.deg
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.8 1st Qu.: 4.9 1st Qu.:19.6 1st Qu.: 23.4 1st Qu.: 10.8
## Median : 5.6 Median : 9.4 Median :28.4 Median : 29.1 Median : 17.2
## Mean : 6.1 Mean :12.2 Mean :27.8 Mean : 28.8 Mean : 19.1
## 3rd Qu.: 7.7 3rd Qu.:16.6 3rd Qu.:36.1 3rd Qu.: 34.5 3rd Qu.: 26.3
## Max. :57.9 Max. :83.0 Max. :88.8 Max. :100.0 Max. :100.0
## NA's :17 NA's :25 NA's :25 NA's :25 NA's :25
## ea.grad.prof.deg MI_Estimate tot.pop CT_<10
## Min. : 0.0 Min. : 2499 Min. : 0 Min. : 0.0
## 1st Qu.: 4.7 1st Qu.: 25068 1st Qu.: 2805 1st Qu.: 6.0
## Median : 8.7 Median : 31500 Median : 3868 Median : 10.5
## Mean : 12.1 Mean : 34406 Mean : 4006 Mean : 13.5
## 3rd Qu.: 16.4 3rd Qu.: 40680 3rd Qu.: 5037 3rd Qu.: 17.4
## Max. :100.0 Max. :204750 Max. :39373 Max. :100.0
## NA's :25 NA's :71 NA's :93
## CT_10-14 CT_15-19 CT_20-24 CT_25-29 CT_30-34
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 7.9 1st Qu.: 9.8 1st Qu.: 9.2 1st Qu.: 3.2 1st Qu.: 8.2
## Median : 12.2 Median : 14.3 Median : 13.5 Median : 5.7 Median : 12.7
## Mean : 13.6 Mean : 15.2 Mean : 14.1 Mean : 6.3 Mean : 13.3
## 3rd Qu.: 17.9 3rd Qu.: 19.7 3rd Qu.: 18.3 3rd Qu.: 8.7 3rd Qu.: 17.6
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :64.6 Max. :100.0
## NA's :93 NA's :93 NA's :93 NA's :93 NA's :93
## CT_35-44 CT_45-59 CT_>60 ES_Total_labor_force
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.1 1st Qu.: 3.4 1st Qu.: 3.5 1st Qu.: 57.3
## Median : 6.0 Median : 6.9 Median : 7.0 Median : 63.7
## Mean : 6.8 Mean : 8.0 Mean : 9.3 Mean : 62.7
## 3rd Qu.: 9.6 3rd Qu.: 11.3 3rd Qu.: 12.6 3rd Qu.: 69.3
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :93 NA's :93 NA's :93 NA's :17
## ES_Civilian_labor_force ES_Civilian_labor_force_employed
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 57.0 1st Qu.: 53.1
## Median : 63.4 Median : 59.9
## Mean : 62.2 Mean : 58.7
## 3rd Qu.: 69.0 3rd Qu.: 65.6
## Max. :100.0 Max. :100.0
## NA's :17 NA's :17
## ES_Civilian_labor_force_unemployed ES_Armed_Forces ES_Not_in_labor_force
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1.8 1st Qu.: 0.0 1st Qu.: 30.7
## Median : 3.0 Median : 0.0 Median : 36.3
## Mean : 3.6 Mean : 0.4 Mean : 37.3
## 3rd Qu.: 4.6 3rd Qu.: 0.0 3rd Qu.: 42.7
## Max. :33.4 Max. :99.4 Max. :100.0
## NA's :17 NA's :17 NA's :17
## land urban poverty no.ins disab
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1 1st Qu.: 3 1st Qu.: 6.6 1st Qu.: 4.1 1st Qu.: 9.3
## Median : 2 Median : 2989 Median : 12.1 Median : 7.3 Median : 12.6
## Mean : 47 Mean : 2793 Mean : 15.3 Mean : 9.1 Mean : 13.5
## 3rd Qu.: 10 3rd Qu.: 4368 3rd Qu.: 20.9 3rd Qu.:12.3 3rd Qu.: 16.7
## Max. :85426 Max. :31777 Max. :100.0 Max. :74.7 Max. :100.0
## NA's :18 NA's :18 NA's :146 NA's :29 NA's :111
## no.comp broad&comp no.eng sing.mom
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.4 1st Qu.: 76.4 1st Qu.: 0.0 1st Qu.: 7.3
## Median : 6.8 Median : 85.0 Median : 1.6 Median : 11.3
## Mean : 8.5 Mean : 82.6 Mean : 4.6 Mean : 13.3
## 3rd Qu.: 11.6 3rd Qu.: 91.6 3rd Qu.: 5.4 3rd Qu.: 17.3
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :164 NA's :164 NA's :164 NA's :164
## live.alone pub.assist no.phone no.plumb married.kid
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 20.4 1st Qu.: 0.8 1st Qu.: 0.8 1st Qu.: 0.0 1st Qu.: 33.4
## Median : 27.1 Median : 1.8 Median : 1.7 Median : 0.9 Median : 40.7
## Mean : 28.1 Mean : 2.7 Mean : 2.2 Mean : 2.4 Mean : 41.2
## 3rd Qu.: 34.6 3rd Qu.: 3.6 3rd Qu.: 3.1 3rd Qu.: 3.0 3rd Qu.: 48.8
## Max. :100.0 Max. :53.5 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :164 NA's :171 NA's :164 NA's :163 NA's :213
## hhd.no.comp hhd.only.phone hhd.no.int hhd.broad
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 5.9 1st Qu.: 2.1 1st Qu.: 8.7 1st Qu.: 54.1
## Median : 10.9 Median : 4.7 Median : 15.6 Median : 68.0
## Mean : 12.5 Mean : 5.9 Mean : 17.4 Mean : 65.7
## 3rd Qu.: 17.2 3rd Qu.: 8.4 3rd Qu.: 23.8 3rd Qu.: 79.4
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :164 NA's :164 NA's :164 NA's :164
## index_apr20
## Min. :0.360
## 1st Qu.:0.855
## Median :0.885
## Mean :0.883
## 3rd Qu.:0.916
## Max. :1.194
##
We confirm that there are null values in the data set. To remove rows with null values in the columns listed below, we use the "na.omit" function. This function removes rows with any missing values in the specified columns. After removing the rows with null values, the resulting dataset would only contain rows with complete data in the specified columns.
add comment here
colSums(is.na(num_df))
num_df <- na.omit(num_df)
Next, we are handling outliers in the numerical data. We apply the outlierKD2 function from the ezids package to detect and remove the outliers in the dataset. We loop through all columns and remove outliers in each column by calling the outlierKD2 function. We remove any resulting NAs and update the data frame after each column. Finally, we remove the last two columns which were added by outlierKD2. We believe that handling outliers is essential in creating an accurate model because outliers can have a significant impact on the results, especially in regression models, and can lead to poor performance and inaccurate predictions.
new_num_df <- outlierKD2(num_df, num_df[[1]], rm=TRUE, boxplt=F, histogram=F,qqplt=F)
new_num_df <- new_num_df[,1:(ncol(new_num_df)-2)]
# loop through all columns
for (col_name in colnames(num_df)[2:ncol(num_df)]) {
# remove outliers
new_num_df <- outlierKD2(new_num_df, new_num_df[[col_name]], rm=T, boxplt=F, histogram=F,qqplt=F)
new_num_df <- na.omit(new_num_df)
new_num_df <- new_num_df[,1:(ncol(new_num_df)-2)]
}
new_num_df <- new_num_df[,1:(ncol(new_num_df)-1)]
After cleaning, the new data set contains 57 rows and 57 columns. The follow table show the summary statistics after cleaning.
summary(new_num_df)
The box plots show the distribution of various health, demographic, and socioeconomic. The plots suggest that obesity, diabetes, and depression rates are relatively high, while education levels and median household income are moderate. The plots also reveal that rural areas have lower population densities and higher poverty rates compared to urban areas. The prevalence of single mothers, those living alone, and individuals receiving public assistance is higher in certain tracts.
# with outliers
# par(mfrow=c(8, 8))
# par(mar=c(1, 1, 1, 1))
# for (i in 1:ncol(num_df)) {
# boxplot(num_df[[i]], main=names(num_df)[i])
# }
# without outliers
par(mfrow=c(8, 8))
par(mar=c(1, 1, 1, 1))
for (i in 1:ncol(new_num_df)) {
boxplot(new_num_df[[i]], main=names(new_num_df)[i])
}
We create scatter plots to visualize the relationship between depression and other variables, as well as poor mental health and other variables. The scatter plots show the distribution of each variable and the trend of the relationship between the variables. The data used in the plots come from a survey that collected information on various demographic, socioeconomic, and health-related factors.
Based on the scatter plot analysis of depression, it appears that depression is positively associated with “OBESITY,” “PHLTH,” and “ea.hs.deg.” On the other hand, there seems to be a negative association between depression and “MI_Estimate,” “ea.grad.prof.deg,” “CT_>60,” and “hhd.broad,” indicating that as these variables increase, the level of depression decreases. The strength of these associations varies, with “OBESITY,” “PHLTH,” and “MI_Estimate” having relatively stronger positive correlations with depression, while “ea.hs.deg” and “ea.grad.prof.deg” have weaker negative correlations.
theme_set(theme_pubr(base_size = 3.5))
plot_list <- list()
for (col in names(new_num_df)) {
if (col != "DEPRESSION") {
plot_data <- data.frame(x = new_num_df[[col]], y = new_num_df$DEPRESSION)
plot <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point(size=0.3) +
geom_smooth(method = "lm", se = FALSE) +
xlab(col) +
theme(legend.position = "none")
plot_list[[col]] <- plot
}
}
ggarrange(plotlist = plot_list, ncol = 7, nrow = 8)
The scatter plots with a fitted line for poor mental health (MHLTH) and the above highly correlated variables show that MHLTH is positively correlated with PHLTH, OBESITY, and poverty. The scatter plot for PHLTH and MHLTH shows a clear positive linear relationship.
# Scatter plots with fit line for mhlth and other dependent variables
theme_set(theme_pubr(base_size = 3.5))
plot_list <- list()
for (col in names(new_num_df)) {
if (col != "MHLTH") {
plot_data <- data.frame(x = new_num_df[[col]], y = new_num_df$MHLTH)
plot <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point(size=0.3) +
geom_smooth(method = "lm", se = FALSE) +
xlab(col) +
theme(legend.position = "none")
plot_list[[col]] <- plot
}
}
ggarrange(plotlist = plot_list, ncol = 7, nrow = 8)
The purpose of this section is to determine the distribution of the dependent variables, depression rate and poor mental health rate, to confirm which model would be appropriate for further analysis. The results show that both depression rate and poor mental health rate are bell-shaped, indicating a normal distribution. This is confirmed through the use of histograms and qqplots, with both variables showing a close-to-straight-line pattern. The normal distribution of the dependent variables meets the assumption of linear regression, indicating that a linear regression model can be used for further analysis. This finding is significant as it allows for the exploration of the relationship between the dependent variables and the independent variables. The use of linear regression can provide valuable insight into potential risk factors for depression and poor mental health, allowing for the development of targeted interventions to improve mental health outcomes.
# Histograms
hist(new_num_df$DEPRESSION, main = "Distribution of Tract-level Depression Rates")
hist(new_num_df$MHLTH, main = "Distribution of Tract-level Poor Mental Health Rates")
# QQ plots for the distribution of tract-level depression rates and tract-level poor mental health rates.
qqnorm(new_num_df$DEPRESSION, main = "Distribution of Tract-level Depression Rates")
qqline(new_num_df$DEPRESSION)
qqnorm(new_num_df$MHLTH, main = "Distribution of Tract-level Poor Mental Health Rates")
qqline(new_num_df$MHLTH)
Add comment here
# Create a correlation matrix
cor_matrix <- cor(new_num_df)
# Create two lists which have the names of variables highly correlated (more then 0.3 or less than -0.3)
high_dep_cor_list <- names(which(cor_matrix["DEPRESSION",] > 0.35 | cor_matrix["DEPRESSION",] < -0.35))
high_mhlth_cor_list <- names(which(cor_matrix["MHLTH",] > 0.5 | cor_matrix["MHLTH",] < -0.5))
high_dep_cor_list <- high_dep_cor_list[high_dep_cor_list != "MHLTH"]
high_mhlth_cor_list <- high_mhlth_cor_list[high_mhlth_cor_list != "DEPRESSION"]
length(high_dep_cor_list)
length(high_mhlth_cor_list)
add comment here
# Create two correlation matrixes from new_num_df in the above lists
high_dep_cor_mat <- cor(new_num_df[high_dep_cor_list])
high_mhlth_cor_mat <- cor(new_num_df[high_mhlth_cor_list])
high_dep_cor_mat
high_mhlth_cor_mat
# Plot above correlation matrix
corrplot(high_dep_cor_mat, type = "lower", outline.color = "white",
colors = c("#6D9EC1", "white", "#E46726"), legend.title = "Correlation",
ggtheme = theme_gray, title = "Highly Correlated Variables with DEPRESSION")
corrplot(high_mhlth_cor_mat, type = "lower", outline.color = "white",
colors = c("#6D9EC1", "white", "#E46726"), legend.title = "Correlation",
ggtheme = theme_gray, title = "Highly Correlated Variables with mhlth")
my_col <- colorRampPalette(c("red", "white", "blue"))(30)
corrplot(high_dep_cor_mat, method = "number", type = "lower", order = "hclust", diag = FALSE, tl.col = "black", tl.cex = 0.8, cl.cex = 0.8, addCoef.col = "black", col = my_col, main = "", mar = c(0,0,0,0))
corrplot(high_mhlth_cor_mat, method = "number", type = "lower", order = "hclust", diag = FALSE, tl.col = "black", tl.cex = 0.8, cl.cex = 0.8, addCoef.col = "black", col = my_col, main = "", mar = c(0,0,0,0))
add comment here
calculate_VIF <- function(data, target_col) {
X <- data[, !colnames(data) %in% target_col]
vif <- data.frame(
Feature = colnames(X),
VIF = apply(X, 2, function(x) vif(lm(x ~ ., data=X)))
)
return(vif)
}
depression_VIF <- calculate_VIF(new_num_df[high_dep_cor_list], "DEPRESSION")
mhlth_VIF <- calculate_VIF(new_num_df[high_mhlth_cor_list], "MHLTH")
depression_VIF
mhlth_VIF
Add comment here
dep_features_list <- high_dep_cor_list[high_dep_cor_list != "DEPRESSION"]
mhlth_features_list <- high_mhlth_cor_list[high_mhlth_cor_list != "MHLTH"]
dep_features_list
mhlth_features_list
Add comment here
add comment here
library(caret)
set.seed(123)
fold <- floor(runif(nrow(new_num_df),1,11))
new_num_df$fold <- fold
test.set <- new_num_df[new_num_df$fold == 1,]
train.set <- new_num_df[new_num_df$fold != 1,]
add comment here
Add comment here
# Perform two-way ANOVA test
depression_anova_model <- anova(depression_model_1, depression_model_2, depression_model_3)
mhlth_anova_model <- anova(mhlth_model_1, mhlth_model_2, mhlth_model_3)
xkabledply(depression_anova_model)
xkabledply(mhlth_anova_model)
health.socioecon.dep.anova <- anova(depression_model_2, depression_model_3)
health.socioecon.mhlth.anova <- anova(mhlth_model_2, mhlth_model_3)
xkabledply(health.socioecon.dep.anova)
xkabledply(health.socioecon.mhlth.anova)
We performed a stepwise model using the AIC criterion to identify the optimal combination of features. AIC accounts for the number of parameters in the model, which helps to prevent overfitting and avoid overly complex models.
# Perform stepwise selection using AIC as the selection criterion
depression_stepwise_model <- step(depression_model_1, direction = "both", trace = 0, k = log(nrow(train.set)), criterion = "AIC")
mhlth_stepwise_model <- step(mhlth_model_1, direction = "both", trace = 0, k = log(nrow(train.set)), criterion = "AIC")
# Print the final selected model
summary(depression_stepwise_model)
summary(mhlth_stepwise_model)
The stepwise model had fewer variables and demonstrated similar performance as the initial model when compared using an ANOVA test. The residual sum of squares (RSS) for the depression and poor mental health models were almost identical. Therefore, we chose the stepwise model over the initial model as it was more parsimony.
dep_anova_model_2 <- anova(depression_model_1, depression_stepwise_model)
mhlth_anova_model_2 <- anova(mhlth_model_1, mhlth_stepwise_model)
xkabledply(dep_anova_model_2)
xkabledply(mhlth_anova_model_2)
This section summarizes a summary of the models we considered to be the best performing based on the model selection to date.
For depression rates, our model identified five significant predictor variables: “OBESITY Rate”, “Poor physical health”, “high school graduate as highest level educational attainment”, “CT_>60”, and “percent of households interacting with the internet”. The model explained 37.6% of the variability in depression levels, which is a moderate amount.
For poor mental health rates, our model identified seven significant predictor variables: “LPA”, “Obesity”, “PHLTH”, “Sleep”, “ea.graduate”, “MI_income”, and “poverty”. The model explained 67.6% of the variability in poor mental health levels, indicating a strong relationship between the predictor variables and poor mental health.
Overall, our findings suggest that demographic, socio-economic, and health-related variables play a significant role in predicting depression and poor mental health rates in the United States. Our models can serve as a basis for further research on this topic and can assist policymakers in identifying high-risk populations and designing targeted interventions.
dep_predictions <- predict(depression_model_1, newdata = test.set)
# calculate the R-squared value
r_squared <- summary(depression_model_1)$r.squared
# calculate the mean squared error (MSE)
mse <- mean((test.set$DEPRESSION - dep_predictions)^2)
# print the results
cat("R-squared:", r_squared, "\n")
cat("MSE:", mse, "\n")
mhlth_predictions <- predict(mhlth_stepwise_model, newdata = test.set)
# calculate the R-squared value
r_squared <- summary(mhlth_stepwise_model)$r.squared
# calculate the mean squared error (MSE)
mse <- mean((test.set$MHLTH - mhlth_predictions)^2)
# print the results
cat("R-squared:", r_squared, "\n")
cat("MSE:", mse, "\n")
Add comment here
# Building a Regression Tree
# Step 1. Use recursive binary splitting to grow a large tree of depression on the training data
train.dep.tree <- rpart(DEPRESSION ~ ACCESS2 + BINGE + CHECKUP + DIABETES + LPA + OBESITY + PHLTH + SLEEP + STROKE + mt.nev.mar + mt.now.mar + MT_Divorces + MT_Separated + MT_Widowed + ea.less.hs.deg + ea.hs.deg + ea.col.ass.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + tot.pop + `CT_<10` + `CT_10-14` + `CT_15-19` + `CT_20-24` + `CT_25-29` + `CT_30-34` + `CT_35-44` + `CT_45-59` + `CT_>60` + ES_Total_labor_force + ES_Civilian_labor_force + ES_Civilian_labor_force_employed + ES_Civilian_labor_force_unemployed + ES_Armed_Forces + ES_Not_in_labor_force + land + urban + poverty + no.ins + disab + no.comp + `broad&comp` + no.eng + sing.mom + live.alone + pub.assist + no.phone + no.plumb + married.kid + hhd.no.comp + hhd.only.phone + hhd.no.int + hhd.broad + index_apr20, data = train.set, method = "anova")
# Grow large tree of mhlth on the training data
train.mhlth.tree <- rpart(MHLTH ~ ACCESS2 + BINGE + CHECKUP + DIABETES + LPA + OBESITY + PHLTH + SLEEP + STROKE + mt.nev.mar + mt.now.mar + MT_Divorces + MT_Separated + MT_Widowed + ea.less.hs.deg + ea.hs.deg + ea.col.ass.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + tot.pop + `CT_<10` + `CT_10-14` + `CT_15-19` + `CT_20-24` + `CT_25-29` + `CT_30-34` + `CT_35-44` + `CT_45-59` + `CT_>60` + ES_Total_labor_force + ES_Civilian_labor_force + ES_Civilian_labor_force_employed + ES_Civilian_labor_force_unemployed + ES_Armed_Forces + ES_Not_in_labor_force + land + urban + poverty + no.ins + disab + no.comp + `broad&comp` + no.eng + sing.mom + live.alone + pub.assist + no.phone + no.plumb + married.kid + hhd.no.comp + hhd.only.phone + hhd.no.int + hhd.broad + index_apr20, data = train.set)
# Plot tree model
fancyRpartPlot(train.dep.tree, main = "Depression Rate Tree")
fancyRpartPlot(train.mhlth.tree, main = "Poor Mental Health Rate Tree")
# Print summary
summary(train.dep.tree)
summary(train.mhlth.tree)
# Generate predicted values for test set
test.set$pred.depression <- predict(train.dep.tree, newdata = test.set)
# Calculate evaluation metrics
MSE <- mean((test.set$DEPRESSION - test.set$pred.depression)^2)
R2 <- 1 - sum((test.set$DEPRESSION - test.set$pred.depression)^2) / sum((test.set$DEPRESSION - mean(test.set$DEPRESSION))^2)
# Print evaluation metrics
cat("R-squared:", R2, "\n")
cat("MSE:", MSE, "\n")
# The R-squared value of 0.399 suggests that the regression tree model explains 39.9% of the variation in the data, which is a moderate level of explanation.
# The MSE (Mean Squared Error) of 6.05 represents the average squared difference between the predicted values and the actual values, with a higher value indicating worse performance.
test.set$pred.mhlth <- predict(train.mhlth.tree, newdata = test.set)
# Calculate evaluation metrics
MSE <- mean((test.set$MHLTH - test.set$pred.mhlth)^2)
R2 <- 1 - sum((test.set$MHLTH - test.set$pred.mhlth)^2) / sum((test.set$MHLTH - mean(test.set$MHLTH))^2)
# Print evaluation metrics
cat("R-squared:", R2, "\n")
cat("MSE:", MSE, "\n")
# The regression tree model with MHLTH as the target variable has an R-squared value of 0.685, indicating that 68.5% of the variability in the MHLTH variable can be explained by the model.
# The mean squared error (MSE) is 1.07, which means that on average, the model's predictions are off by 1.07 units from the actual values. .
# The root mean squared error (RMSE) is 1.04, which is the same as the standard deviation of the residuals.
# This indicates that the model's predictions have a standard deviation of 1.04 around the actual values. Overall, these metrics suggest that the model has a moderate level of predictive power for the MHLTH variable.
Need edited
Need edited
Need edited
Need edited
Need edited
Need edited # Reference links to datasets: https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh https://gisgeography.com/us-county-map/ https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data/ https://data.census.gov/ https://www.anl.gov/dis/county-economic-impact-index https://www.census.gov/programs-surveys/community-resilience-estimates/data/datasets.html https://github.com/NREL/hsds-examples